Data Selection for IT Texts using Paragraph Vector
نویسندگان
چکیده
This paper presents an overview of the system submitted by the University of Hamburg to the IT domain shared translation task as part of the ACL 2016 First Conference of Machine Translation (WMT 2016). We have chosen data selection as a domain adaptation method. The filtering of the general domain data makes use of paragraph vectors as a novel approach for scoring the sentences. Experiments were conducted for English-German under the constrained condition.
منابع مشابه
Generative Paragraph Vector
The recently introduced Paragraph Vector is an efficient method for learning highquality distributed representations for pieces of texts. However, an inherent limitation of Paragraph Vector is lack of ability to infer distributed representations for texts outside of the training set. To tackle this problem, we introduce a Generative Paragraph Vector, which can be viewed as a probabilistic exten...
متن کاملAn Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification
Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...
متن کاملDocument Embedding with Paragraph Vectors
Paragraph Vectors has been recently proposed as an unsupervised method for learning distributed representations for pieces of texts. In their work, the authors showed that the method can learn an embedding of movie review texts which can be leveraged for sentiment analysis. That proof of concept, while encouraging, was rather narrow. Here we consider tasks other than sentiment analysis, provide...
متن کاملSentiment Analysis with Deeply Learned Distributed Representations of Variable Length Texts
Learning good semantic vector representations for phrases, sentences and paragraphs is a challenging and ongoing area of research in natural language processing and understanding. In this project, we survey and implement several deeplearning and deep-learning-inspired approaches and evaluate these algorithms on several sentiment-labeled datasets and analysis tasks. In doing so, we demonstrate n...
متن کاملDistributed Representations of Sentences and Documents
Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, “powerful,” “strong” and “Paris” are eq...
متن کامل